# SDF-HF
This repository contains the code required to replicate the paper, "News and Asset Pricing: A High-Frequency Anatomy of the SDF" (Aleti and Bollerslev, 2024). A description of the code is provided below. 

## Paper Information
**Abstract:** Utilizing real-time newswire data together with a robustly estimated intraday Stochastic Discount Factor (SDF), we identify and precise the economic news that is priced. News related to monetary policy and finance on average accounts for most of the variation in the SDF, followed by news about international affairs and macroeconomic data. We also document non-trivial temporal variation in the relative importance of the news, along with marked differences in the estimated news risk premia in the “factor zoo.” Further highlighting the economic mechanisms at work, we associate the different news effects with interest rate, growth, and risk premium shocks.

**Citation:** Aleti, S., & Bollerslev, T.  (2024). News and Asset Pricing:
A High-Frequency Anatomy of the SDF. *Review of Financial Studies, forthcoming*.

## Directory structure
The directory structure of the repository is as follows:

* `code/` contains the code for the analysis
* `data/` contains the raw and processed data
* `results/` contains the results of the SDF estimation and analysis

### Zip Files
For the zipped version of the code, the instructions are as follows:

* `code.7z`: unzip in `./code`
* `data.7z.###`: unzip in `./data`, note that files are split due to size constraints
* `docs.7z`: unzip in `./docs`
* `results.7z`: unzip in `./results`

### Tree
The directory tree is given below. The unzipped contents should match the structure below.
```
├── code
│   ├── _config
│   ├── analysis_descriptive
│   ├── analysis_lda
│   ├── analysis_sdf
│   ├── estimation_sdf
│   ├── helper_libraries
│   ├── identification_news
│   └── preparation_data
├── data
│   ├── bbdk
│   ├── bkmx
│   ├── calendars
│   ├── cpz
│   ├── crsp
│   │   ├── daily
│   │   ├── dlret
│   │   └── me
│   ├── dowjones
│   │   ├── filter
│   │   └── preproc
│   ├── etc
│   ├── ff
│   ├── jkp
│   ├── jln
│   ├── keys
│   ├── mcng
│   ├── proc
│   │   ├── dtms
│   │   ├── dtms_agg
│   │   ├── dtms_agg_5min
│   │   ├── dtms_agg_rolling
│   │   ├── dtms_agg_rolling_5min
│   │   ├── factor_returns
│   │   │   ├── all
│   │   │   ├── cz
│   │   │   ├── ff6
│   │   │   ├── ind48
│   │   │   └── jkp
│   │   ├── factor_returns_with_jumpind
│   │   │   └── hfzoo
│   │   ├── factor_returns_with_jumpind_5min_35
│   │   │   └── hfzoo
│   │   ├── macro_series
│   │   ├── ngrams
│   │   ├── tickdata
│   │   ├── topic_weights
│   │   ├── topic_weights_5min
│   │   ├── topic_weights_extended_1
│   │   ├── topic_weights_extended_2
│   │   └── topic_weights_extended_3
│   ├── taq
│   ├── tickdata
│   └── wg
├── docs
└── results
    ├── condfactor_pricing
    ├── lda_analysis
    └── sdf_estimate_main
        └── variable_importance
            ├── avgabsgrad
            └── loo
```

## Data 
The following folders contain proprietary data that has been excluded/subsampled/anonymized:
* `data/crsp`: proprietary data from CRSP
	* a sample of the data in each subfolder is included 
	* samples are a single observation from 2020-01
	* permnos, permcos, and cusips have been anonymized
* `data/dowjones`: proprietary data from Dow Jones (Newswires)
	* for the `raw` subfolder, there is no sample data included for legal reasons
	* for the `filter` and `preproc` folder, pseudo-data is included to highlight the structure of the data files 
* `data/tickdata`
	* subsamples of the data the TU and TY futures are included
	* the price data in both samples are mixed with random numbers to anonymize 
	* a subsample of the duration data is also included
* `data/proc/dtms*`
	* these are various files containing the document-topic-matrices 
	* anonymized data samples are included instead
* `data/proc/topic_weights*`
	* these are various files containing the topic weights 
	* anonymized data samples are included instead
* `data/proc/ngrams`
	* this folder contains the ngrams for each of the filtered articles
	* pseudo-data is included instead
* `data/proc/tickdata`
	* this folder contains the cleaned futures yields data 
	* pseudo-data is included instead 

## Code
Below is a description of the code in the repository. The execution is based on Python `3.6.13` and the relevant packages are given in the `requirements.txt` file. Much of the execution also relies on a high-performance computing cluster based on the SLURM Workload Manager. 

### Data Preparation
The code for data preparation is in the `code/data_preparation` folder. The order of operations to run the code is as follows:

* `merge_factors.ipynb`
  * merges the factor return data from the High-Frequency Factor Zoo project
  * cuts out risk-free rate from market and industry factors to ensure they are net-zero
  * keep in mind returns are percentage returns
  * output is in `data/proc/factor_returns/all`
* `factor_jump_detection.ipynb`
  * detects jumps in the factor returns
  * these returns are log returns
  * saves the split factors to `data/proc/factor_returns_with_jumpind`
* `clean_macro.ipynb`
  * cleans the macroeconomic data
  * can be run at any point
  * saves the cleaned data to `data/proc/macro_series`
* `preprocess_djn.ipynb`
  * preprocesses the Dow Jones News data
  * saves the cleaned data to `data/proc/djn`
* `clean_tickdata.ipynb`
  * loads raw tickdata files from `data/tickdata`
  * cleans up prices to obtain yields
  * saves to `data/proc/tickdata/`
* `prepare_mp_calendar.ipynb`
  * prepares monetary policy calendar by combining existing calendars
  * produces basic descriptive stats
  * saves output to `data/calendars`

### News Identification
The code for news identification is in the `code/identification_news` folder. The order of operations to run the code is as follows:

* `filter_djn_array.py`
  * filters the Dow Jones News data for each date
  * parallelized over months
  * saves the filtered data to `data/dowjones/filter`
* `construct_dtms_array.py`
  * constructs the document-term matrices for each date
  * uses the filtered dow jones data
  * parallelized over months
  * saves the dtms to `data/proc/dtms` 
* `aggregate_dtms.ipynb`
  * aggregates the constructed dtms up to the desired frequency
  * also produces aggregated dtms for 5min robustness check
  * at the moment, dtms_agg_rolling are the same as dtms_agg
  * saves the aggregated dtms to `data/proc/dtms_agg`, `data/proc/dtms_agg_rolling`, and `data/proc/dtms_agg_rolling_5min`
* `construct_topic_weights.ipynb`
  * constructs the topic weights for each date
  * uses the aggregated dtms
  * saves the topic weights to `data/proc/topic_weights` and `data/proc/topic_weights_5min` 
* `freq_topic_precision.ipynb`
  * loads calendars and dtms
  * studies how the topic weights depend on the sparsity parameter
  * also does a low versus high frequency comparison

### Descriptive Analysis
The code for the descriptive analysis is in `code/analysis_descriptive`. There is no order of operations to run the code, but the analysis itself depends on the data preparation results produced by the code in `code/data_preparation`. The set of code in `code/analysis_descriptive` is as follows:

* `topic_tables.ipynb`
  * generates basic stats about the topic assignment spreadsheet
  * produces latex tables from this spreadsheet
* `djn_summary_stats.ipynb`
  * loads the preproc dow jones data and the dtms
  * generates various summary statistics

### SDF Estimation
The code for SDF estimation is in the `code/identification_sdf` folder. The order of operations to run the code is as follows:

* `prepare_estimation.ipynb` 
  * sets up the test and span asset lists
  * saves to `data/keys`
* `learn_sdf_yearly_array.py`
  * run as slurm job
  * generates sdf estimates for each hyperparameter choice and year in the data
  * the chosen year is the "out-of-sample" year
  * output is in `results/sdf_estimates_rolling`
  * models themselves are saved in the fit_params.checkpoint_folder specified in `../_config/main.yaml`
* `aggregate_sdf_estimates_array.py`
  * run as slurm job
  * combines the estimates for each year into a single file
  * will run even if there is missing output, so check for missing estimates first
  * log file will warn for missing output
* `construct_validated_sdf.ipynb`
  * loads all the sdf estimates and selects the best one for each year
  * validation procedure is described in the paper
  * output is in `results/sdf_estimates`
* `compute_varimp_avgabsgrad_array.py` 
  * run as slurm job
  * computes variable importance measure for each macro variable
  * saves results to `results/sdf_estimate_main/variable_importance/avgabsgrad_*`
* `compute_varimp_loo_array.py` 
  * run as slurm job
  * computes leave-one-out importance for each macro variable
  * saves results to `results/sdf_estimate_main/variable_importance/loo_*`

### SDF Analysis
The code for SDF analysis is in the `code/analysis_sdf` folder. There is no order of operations to run the code, but the analysis itself depends on the estimation results produced by the codde in `code/identification_sdf`. The set The code in `code/analysis_sdf` is as follows:

* `describe_sdf.ipynb`
  * loads sdf returns and factor returns
  * main descriptive analysis of the SDF estimates
  * computes alphas, returns, volatility, correlations
  * relies on `get_condfactor_pricing_errors.py` for a conditional pricing error analysis
* `get_condfactor_pricing_errors.py`
  * computes pricing errors on randomly generated conditional factors
  * run as a job script
  * output is in `results/condfactor_pricing`
* `estimate_topic_premia.ipynb`
  * loads sdf returns and topic weights
  * estimates and plots topic risk premia
* `estimate_factor_topic_decomp.ipynb`
  * loads sdf estimates, factor returns, news topic weights, and shock classifications
  * decomposes the factor risk premia by topic
  * and decomposes the factor risk premia by news shock
* `freq_5minresults.ipynb`
  * reconstructs sdf at 5-min frequency
  * loads 5-min news data
  * reproduces topic results using 5-min data as a robustness check
* `freq_excess_volatility.ipynb`
  * loads 1-min sdf estimates and calendar data
  * computes the excess volatility around events
* `freq_jump_detection.ipynb`
  * loads sdf returns and calendar data
  * detects jumps in the sdf returns across frequencies and compares
  * produces examples of detection errors
* `freq_return_precision.ipynb`
  * loads sdf returns and calendar data
  * compares high-freq jump returns with low-freq returns on same day
* `study_event_examples.ipynb`
  * loads sdf returns and topic weights
  * plots examples of event headlines associated with jumps
* `study_shocks.ipynb`
  * loads sdf returns, yield data, and topic weights
  * connects jumps with different shocks
  * estimates premia for shocks and news-shock links
  * creates `topic_class.parquet` and `shock_classifications.parquet` files
* `study_topic_monetary_policy.ipynb`
  * loads sdf returns, calendar data, and topic weights
  * decomposes premia for monetary policy topic
  * also relies on `topic_class.parquet` for shock links
* `study_jump_topic_frequency.ipynb`
  * loads sdf returns and topic weights
  * compares the frequency of jumps across topics
* `visualize_sdf_jumps.ipynb`
  * loads sdf returns
  * visualizes the jump detection procedure
  * provides basic stats about jumps

### LDA Analysis
The code for the LDA analysis is in the `code/analysis_lda` folder. The order of operations to run the code is as follows:
* `prepare_ngrams_array.py`
  * prepares the ngrams for the underlying news data
  * run as a job script
  * output is in `data/proc/ngrams`
* `lda_validation_arary.py`
  * runs LDA over a range of topic counts
  * saves the perplexities of the resulting models
  * output is in `results/lda_validation`
* `lda_training_main.ipynb`
  * loads the ngrams 
  * loads the validation output to pick the optimal topic count
  * trains the LDA model using said optimal topic count
  * saves the fitted model to `results/lda_analysis/main_estimate.bin`
* `lda_reclustering.ipynb`
  * loads the fitted lda model
  * performs clustering on the topics
  
